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1. INTRODUCTION 

One of the most important phases in the creation of high-quality software is the software testing 
activities. However, software testing activities are higly expensive [1]. Numerious studies found that most 
software errors are caused by only some few parts of the software modules [2], [3]. To reduce the number of 
resources required for testing the software project, prior to testing activities, software testing team can use 
software defect prediction (SDP) tools to predict defect-prone modules in software projects [4]. Various 
studies have shown that SDP under within-project defect prediction method (WPDP) can effectively identify 
defect prone modules i.e., where historical defect data used for training and testing is extracted from the same 
project [4]. WPDP has an obvious challenge when a software project is new or has limited historical defect data 
[5]. In response to the challenge of WDPD, cross-project defect prediction (CPDP) was introduced i.e., where 
historical defect data of one project (source) is used to predict defect-prone modules in another project (target) 
Nam et al. [6]. However, although the potential usefulness of the CPDP has been validated by Ma et al. [7], 
Peters et al. [8], and Rahman et al. [9], CPDP has a poor predictive performance because of the difference 
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between the source and target projects in terms of programming languages, coding styles and developers’ 
experience. Another factor that also affect the predictive performance of CPDP is the weakness of individual 
machine leaning algorithm used for building prediction model [10]. To address the challenge of poor 
prediction performance in CPDP method, various studies have proposed difference approaches; [1] proposed 
transfer cost-sensitive boosting, Liu et al. [4] proposed two phase transfer learning, He et al. [5] proposed 
simplified training data method, Sun et al. [11] proposed filtering method for selecting suitable source project 
for CPDP, Chen et al. [12] proposed deep learning based CPDP, Zhao et al. [13] proposed Manifold feature 
transformation for CPDP, Lei et al. [14] proposed CPDP based on feature selection and distance-weight 
transfer learning. However, the predictive performance of CPDP is till below applicable level and therefore 
need improvement. 

To improve the predictive performance of CPDP, in this study, we proposed a similar attribute 
selection and multi-learning method (SASMLM), for cross-project defect prediction. Specifically, to improve 
the predictive performance of CPDP method by minimizing the difference between source and target projects 
and also the weakness of the individual classifier to determine the effectiveness of our proposed method, we 
conducted an experiment using eight open-source datasets from AEEEM and Relink. The results of the 
experiments showed that the proposed SASMLM achieved better Fl_score results compared to the baseline 
CPDP approaches based on both individual dataset and on average scales. To further confirm that the 
achievement of our proposed SASMLM is not by chance, we analysed the prediction performance of 
SASMLM statistically against each baseline method using Wilcoxon rank-sum tests. The statistical result 
indicated that our proposed SASMLM method outperformed all the baseline method significantly. This study 
contributes in three ways. This study provides: i) method for reducing the difference between source and 
target project, ii) method for reducing the impact of weak classifier on the performance of cross-project 
defect prediction, and iii) details analysis on the impact of using multiple classifiers over single classifier 
when building model for cross-project defect prediction. 


2. RELATED WORK 

Briand et al. [15] conducted a feasibility study on the CPDP technique, in which an object-oriented 
project dataset was used to train a machine learning classifier for CPDP. Zimmermann et al. [16] motivated 
by this, experimented extensively with 622 cross-project defect predictions. The outcomes showed that just 
3.4% of CPDPs were effective. They concluded that reducing the disparity in distribution between the source 
and target projects could enhance the CPDP's performance. In addition, the predictive performance of CPDP 
can be enhanced by selecting appropriate machine learning algorithms. As a result, several studies were 
conducted to improve CPDP's predictive ability. Selecting the attributes (features) that are the closest to one 
another is one effective strategy for reducing the difference between two projects. Watanabe et al. [17], 
focusing on attribute level suggested transforming the training set for CPDP using the mean distributional 
characteristics of attributes from both the source and target projects. Nam et al. [6] proposed a trichloroacetic 
acid (TCA+) that reduces the distribution difference and improves CPDP's predictive performance by 
transforming the attributes of the source and target projects into a latent space. Zhao et al. [13] proposed a 
manifold-based feature transformation CPDP strategy. The Naive Bayes (NB) classifier was trained using the 
transformed source project, and the trained NB was then used for the CPDP. Yuan et al. [18] proposed a two- 
phase method for selecting a suitable CPDP training set. The attributes of the source and target projects were 
first grouped using the clustering method. To select the appropriate training data for the CPDP, the local 
density of features, feature class relevance, and feature similarity were then taken into consideration. 

In addition, to lessen the gap between the source and target projects, some studies concentrate on 
instance selection. Turhan et al. [19] proposed an approach which uses K-nearest neighbor (KNN) to select 
most similar instacess in source and target projects. He et al. [20] selected relevant instances with regard to 
defect count for CPDP using various similarity techniques. Ma et al. [7] suggested a method in which each 
instance in the source project was given weights based on how similar it was to the target project. A 
classification model for CPDP was trained using weighted instances. Peters et al. [8] proposed the peters 
filter, a method for selecting suitable training data for CPDP using K-means clustering technique. Herbold 
[21] suggested using the EM clustering algorithm and KNN to find similar instances that could be used as 
CPDP training data. Amasaki et al. [22] suggested a strategy in which the appropriate instance for training 
the CPDP model was obtained through the utilization of density-based spatial clustering of applications 
(DBSCAN) clustering methods. Li et al. [23] suggested a method in which a dictionary was used to learn the 
instances that were similar between the source and target projects. Ni et al. [24] proposed a method for CPDP 
in which initially, similar instances between the source and the target project were chosen utilizing the cluster 
technique. Second, the limited label in the target project was used in conjugation with TrAdaboost to select 
instances from the source project. 


Bulletin of Electr Eng & Inf, Vol. 13, No. 3, June 2024: 2027-2035 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 O 2029 


3. METHOD 
3.1. Framework 

CPDP model train on previous records (dataset) of other software project (source) to predict defect 
on another project (target) to reduce the issue of limited or no previous record. The main challenge is the 
poor performance resulting from differences between projects and weakness of individual classifier. 
Selecting most similar attributes between the projects for training and adopting stacking technique instead of 
individual single classifercan improve the performance of CPDP method. 

Based on this, this study proposes a method SASMLM. As shown in Figure 1, the method consists 
of two main parts. In similar attribute selection part, after taking two different software projects as inputs, one 
serving as source S and the other as target T, the mean difference (MD) between all the features in the source 
and target datasets is computed, and features with a minimum MD are extracted from the source and target 
datasets denoted as training set (TrS) and testing set (TS) for training and testing, respectively. In multi- 
learning part, in the first stage, a set of base classifiers, KNN, SVM, and RF trained on TrS and tested on TS. 
In the second stage, the meta-classifier LR is learned from the output of the KNN, SVM, and RF, and the 
learned meta-classifier (LR) makes a final prediction of TS. Finally, the Fl_score is used to evaluate the 
predictive performance of the proposed method. 


Training Training 
set (TrS | 


Source 


Similar attribute 
selection 


Training 


ini aaa 
Performance 
evaluation 


Figure 1. Framework for SASMLM 


3.2. Algorithm description 

In the initialization phase, (line 1 to 12 in the Algorithm 1), the mean feof each feature f© in the 
source project S and the mean f” of each feature f® in the target project T are computed. To find the similar 
features between features in the source and target projects, we compute the absolute MD between each 
feature in the source and the corresponding feature in the target project, for example, mean of feature 1 in 
source project is subtracted from mean of feature 1 in target project. The absolute values of MD are then 
arranged in ascending order. We select N top features in MD and used them as common features (CF) 
between the source and target projects. We then select all features in the source project that are same with the 
features in the CF and used them as TrS, similarly, all features in the target project that are same with 
features in CF are selected and used as TS. 

In the training phase, (line 14 to 16 in the Algorithm 1), we employed method similar to stacking, 
KNN, SVM, and RF are trained on TrS and tested on TS. The prediction performance of all the three models, 
evaluated using Fl_score is then used as training set for LR. In the final prediction phase, (line 17 to 19 in 
the Algorithm 1), the trained LR model is then used to predict defect in the TS. Prediction performance of the 
LR model was evaluated using F1_score. 
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Algorithm 1 for proposed SASMLM 

Input: S - Source 

T - Target 

N — The number of selected similar features 
Output: F1_score 


1: Get all the features in S, mark as f9 = TAGA FE fe FOJ 

2: Get all the features in T, mark as f = on a wi cae 

3: for eachf and f ®© 

4: Compute the mean, markas fand ee 

5: for each pair of fand a 

6: Compute the absolute difference, marked as MD = { i — fe, JAA E 0D, fe a fe S (9) _ Ae 
7 
8 


: end for 
: Rank the values in MD in ascending order; 
9: Select N top features in MD, mark as CF; 
10: Transpose CF; 
11: Select all features from f (9 that correspond with features in transpose CF, mark as Training Set (TrS); 
12: Select all features from f that correspond with features in transpose CF, mark as Testing Set (TS); 
13: end for 
14: Train the base models (KNN, SVM and RF) on TrS; 
15: Test the trained base models on the TS and calculate their performance in F1_score; 
16: Use the trained base models as features to train the second level model (LR); 
17: Use the trained LR to perform a final prediction on the TS; 
18: Compute the performance of LR using F1_score; 
19: Return Fl_score; 


3.3. Experimental setup 

To evaluate the proposed SASMLM systematically, we established the following research 
questions: i) RQ1: Does the proposed SASMLM performed better cross-project defect prediction than the 
compared CPDP methods?; ii) RQ3: How does the different number of learners affect the prediction 
performance of SASMLM?, and iii) RQ4: How does the difference combination of learners affect the 
prediction performance of our SASMLM? 


3.4. Dataset 

In this study, we conducted CPDP on eight open-source and commonly used datasets from AEEEM 
and Relink. Table 1 provides details of these projects. AEEEM dataset was collected by D’Ambros et al. [25]. It 
consists of five projects. Each project in the AEEEM constitutes a total of 71 attributes in different categories, 
including entropy of change, entropy of source code, source code, previousdefect metrics, and churn of source 
code. Relink dataset was collected by Wu et al. [26]. It consists of three projects, each with 26 attributes. 


Table 1. Datasets 


Project name Modules Features _ Defective modules 


EQ 325 71 129 
JDT 997 71 206 
LC 399 71 64 
ML 1862 71 245 
PDE 1492 71 209 
Apache 194 26 98 
Safe 56 26 22 
ZXing 399 26 118 


3.5. Evaluation measures 
To evaluate the prediction performance of our approach, we used F-measure which is one of the 
most used evaluation measures in SDP [27], [28]. F-measure is the harmonic mean of precision and recall. 
They are defined as follows: 
— Precision: determined the ratio of software modules that were accurately predicted to be non-defective 
to all modules that were predicted to be non-defective. 
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ia TP 
Precision —— (1) 
TP+FP 


— Recall: determined the ratio between the total number of non-defective modules and the number of 
modules that were accurately predicted to be non-defective. 


Recall —— (2) 


TP+FN 


—  F-measure: defined the hamonic representation of precision and recall. The higher F-measure means 
better performance. 


2 * Precision * Recall 
F — measure ——————— (3) 
Precision + Recall 


3.6. Baselines 

We first experimentally compared SASMLM with four baseline methods in CPDP to determine the 
effectiveness of our proposed approach. The baseline methods are Watanabe, Burak, TCA+, and MFTCPDP, 
which were proposed by Nam et al. [6], Zhao et al. [13], Watanabe et al. [17], and Turhan et al. [19], 
respectively. Experiments were conducted on the AEEEM and Relink datasets. The experimental results 
were measured using F_measurement. 


3.7. Evaluation settings 

To be in conformity with the previous studies in SDP, we arrange all the datasets in AEEEM and 
Relink in pairs. For instance, when EQ was used as a source, each of the other projects in AEEEM was used 
as a target i.e., EQ => JDT, ML, PDE, and LC. In all the two benchmark datasets, we conducted 26 CPDP 
experiments from AEEEM and Relink datasets. For the evaluation of the results, we used F-measure 
(Fl_score). 


3.8. Statistical significance test 

To evaluate whether SASMLM significantly performed better than the compared baseline methods. 
Like existing SDP studies [7], [24], [29], we employed a nonparametric Wilcoxon signed-rank test at a at a 
confidence level of 95%. This test is a non-parametric test use to confirm whether there is a statistical 
significance difference between two related variables. 


4. RESULTS AND DISCUSSION 
4.1. Effectiveness of SASMLM 

RQ1: Does the proposed SASMLM performed better cross-project defect prediction results than the 
baseline CPDP methods? 

The comparison results of SASMLM against the compared CPDP methods is reported in this section 
as shown in Tables 2 and 3. From these tables, we can observe that SASMLM achieves better prediction 
results in most of the combinations against the compared baseline methods based on F1_score. On individual 
AEEEM dataset, SASMLM won 15, 11, 17, and 18 out of 20 datasest combinations against MFTCPCP, 
TCA+, Watanabe and Burak respectively. On the Relink datasets, SASMLM won 3, 4, 6, and 5 out of 6 
datasets combinations against MFTCPCP, TCA+, Watanabe, and Burak respectively. Also, considering the 
average results from across 26 prediction combinations, SASMLM performed better than the baseline 
methods with mean of 0.82 and 0.73 on AEEEM and Relink respectively. To further confirm that the 
achiement of SASMLM is not by chance, we analysed the prediction performance of SASMLM statistically. 
Like previous studies [1], [30], [31] we established the following hypothesis: 

a. Hlo: SASMLM does not achieve better prediction performance than the four baseline CPDP methods. 
b. Hla: SASMLM can achieve better prediction performance than the four baseline CPDP methods. 

We constructed four groups of Wilcoxon rank-sum tests: SASMLM and MFTCPDP, SASMLM and 
TCA+, SASMLM and Watanabe, and SASMLM and Burak. At the 95% significance level, the statistical p- 
values obtained were 0.002 for SASMLM and MFTCPDP, 0.000 for TCA+ and 0.000 for SASMLM and 
Watanabe, and 0.000 for SASMLM and Burak. The p-values of SASMLM and MFTCPDP, SASMLM and 
TCA+, SASMLM and Watanabe, and SASMLM and Burak are less than the significant value of 0.05, this 
indicates that the proposed SASMLM has significantly outperformed all the compared approaches. 
Therefore, we rejected Hlo and accepted H14. In other words, SASMLM can achieve better F-measure 
results with statistical significance. A better F-measure indicates that SASMLM can be effective for 
predicting defects across different projects. 
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Table 2. Comparison of SASMLM to baseline methods on AEEEM dataset based on F1_ score 


Source—Target _MFTCPDP TCA+ Watanabe Burak SASMLM 
EQ—JDT 0.53 0.46 0.74 0.43 0.87 
EQ—LC 0.84 0.48 0.02 0.81 0.95 
EQ>ML 0.76 0.39 0.35 0.59 0.94 
EQ—PDE 0.72 0.62 0.76 0.53 0.75 
JDT—EQ 0.56 0.61 0.69 0.45 0.71 
JDT—LC 0.89 0.80 0.83 0.86 0.75 
JDT—ML 0.84 0.77 0.78 0.81 0.75 
JDT—PDE 0.83 0.80 0.78 0.79 0.88 
LC—EQ 0.67 0.55 0.68 0.47 0.65 
LC—>JDT 0.57 0.73 0.82 0.70 0.73 
LC—>ML 0.78 0.47 0.81 0.81 0.87 
LC—PDE 0.78 0.80 0.81 0.80 0.9 
ML—>EQ 0.62 0.64 0.68 0.45 0.64 
ML—JDT 0.61 0.73 0.83 0.70 0.88 
ML—>LC 0.88 0.86 0.86 0.86 0.88 
ML—PDE 0.82 0.82 0.82 0.79 0.92 
PDE—EQ 0.63 0.61 0.69 0.45 0.60 
PDE—JDT 0.72 0.74 0.83 0.70 0.85 
PDE—>LC 0.88 0.79 0.81 0.86 0.92 
PDE-ML 0.83 0.79 0.81 0.81 0.91 
Mean 0.74 0.67 0.72 0.68 0.82 


Table 3. Comparison of SASMLM to baseline methods on relink dataset based on F1_ score 


Source—Target _ MFTCPDP TCA+ Watanabe Burak SASMLM 
Safe— Apache 0.71 0.67 0.72 0.57 0.71 
ZXing— Apache 0.65 0.67 0.71 0.36 0.63 
ZXing—Safe 0.77 0.51 0.72 0.74 0.70 
Apache—Safe 0.46 0.57 0.72 0.23 0.78 
Apache—ZXing 0.59 0.60 0.65 0.16 0.78 
Safe—Zxing 0.65 0.63 0.64 0.60 0.78 
Mean 0.64 0.61 0.69 0.44 0.73 


4.2. Effect of the number of classifiers (S) on prediction performance of SASMLM 


RQ2: How number of classifiers have effect on SASMLM prediction performance? 

To investigate the effect of different numbers of classifiers S on the prediction performance of 
SASMLM, we conducted cross-project defect pre-diction experiments using different numbers of classifiers, 
starting from S=1 (one classifier), S=2 (two classifiers), and S=3 (three classifiers), as shown in Table 4. The 
average results across the AEEEM project combinations are also reported in the last row of the corresponding 
tables. From the table, we can see that the prediction performance of SASMLM is slightly better when S=3 
(the number of classifiers used as base learners) than the average performance of other variants. 


Table 4. Results of SASMLM based on different number of learners on AEEEM datasets 


Source—Target S=3  S=2 S=1 
EQ—JDT 0.71 0.74 0.67 
EQ—>LC 0.60 0.54 0.48 
EQ—>ML 0.65 0.59 0.66 
EQ—PDE 0.64 0.57 0.52 
JDT—EQ 0.73 0.73 0.74 
JDT—LC 0.87 0.84 0.88 
JDT>ML 0.85 0.80 0.85 
JDT—PDE 0.88 0.87 0.88 
LC—EQ 0.75 0.75 0.75 
LC—>JDT 0.88 0.88 0.88 
LC—>ML 0.9 0.93 0.93 
LC—PDE 0.92 0.92 0.92 
ML—EQ 0.75 0.75 0.75 
ML—>JDT 0.88 0.88 0.88 
ML—>LC 0.95 0.95 0.95 
ML—PDE 0.92 0.92 0.92 
PDE—EQ 0.75 0.75 0.75 
PDE—>JDT 0.87 0.88 0.88 
PDE-LC 0.94 0.94 0.95 
PDE—ML 0.91 0.93 0.93 
Mean 0.82 0.81 0.81 
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4.3. Effect of the difference combinations of classifiers on prediction performance of SASMLM 

RQ3: How difference combinations of classifiers have effect on SASMLM prediction performance? 

To answer this question, we conducted an experimental investigation on the performance of 
SASMLM using a combination of KNN and RF, KNN and SVM, and RF and SVM as base learners on 
AEEEM datasets. The results obtained, as shown in Table 5, show that the SASMLM achieved the highest 
average Fl_score (0.83) using the KNN and RF as base learners compared to the other combinations. 
Therefore, this result shows that KNN and RF are suitable as base learners for SDP. This agrees with the 
findings of Akour et al. [32] and Goyal et al. [33] on RF and KNN, respectively. 


Table 5. Results of SASMLM based on combination of different classifiers used as base learners on AEEEM 


datasets 
Source—Target _KNNandRF KNN andSVM RF and SVM 
EQ—JDT 0.74 0.78 0.71 
EQ—>ML 0.73 0.49 0.55 
EQ—PDE 0.65 0.54 0.51 
EQ—>LC 0.85 0.39 0.39 
JDT—EQ 0.72 0.74 0.74 
JDT—ML 0.72 0.84 0.85 
JDT—PDE 0.85 0.88 0.87 
JDT—LC 0.89 0.83 0.81 
ML-—EQ 0.75 0.75 0.75 
ML—JDT 0.88 0.88 0.88 
ML—PDE 0.92 0.92 0.92 
ML—>LC 0.95 0.94 0.95 
PDE—EQ 0.75 0.75 0.75 
PDE—JDT 0.88 0.88 0.88 
PDE-ML 0.93 0.93 0.92 
PDE—>LC 0.95 0.94 0.94 
LC—EQ 0.75 0.75 0.75 
LC—>JDT 0.88 0.88 0.88 
LC—ML 0.92 0.93 0.93 
LC—PDE 0.92 0.92 0.92 
Mean 0.83 0.80 0.80 


5. CONCLUSION 

In this study, we propose a new method for improving the predictive performance of a CPDP called 
SASMLM. SASMLM first computes the mean vector of all attributes in both the source and target projects, 
and selects the most similar ones. The selected attributes were then used to train the base learners. The 
trained base learners were tested on the target project. The results of the base learners are used to train the 
meta-learner, and the trained meta-learner is used for the final prediction. We conducted experiments to 
evaluate SASMLM on eight projects from two benchmark datasets, using the F-measure as the evaluation 
metric. The non-parametric Wilcoxon signed-rank test was used to evaluate the significant difference 
between SAMLM and the four compared methods. The results show that SASMLM can achieve better 
prediction performance than competing CPDP methods. Our findings, therefore, show that SASMLM can be 
useful for SDP, especially when there is a shortage of historical records for a particular project. In future 
studies, we intend to extend this study by using more classifiers as base learners. Adding new classifiers may 
help the approach predict more defects. 
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